One Way to Create Define.xml Files

 

Implementation, Obstacles and Enhancements

 

 

 

 

 

Authors            Emöke Merli, Edith Heimsch, Dr. Elke Sennewald

 

Company         Kendle International Inc.

Stefan-George-Ring 6

81929 München

 

Contact            Emöke Merli

                        Stefan-George-Ring 6

                        81929 München

                        Tel. +49 (0) 89 / 99 39 13 181

                        Fax +49 (0) 89 / 99 39 13 124

                        merli.emoeke@kendle.com

 

                        Edith Heimsch

                        Stefan-George-Ring 6

                        81929 München

                        Tel. +49 (0) 89 / 99 39 13 337

                        Fax +49 (0) 89 / 99 39 13 124

                        heimsch.edith@kendle.com

 

                        Dr. Elke Sennewald

                        Stefan-George-Ring 6

                        81929 München

                        Tel. +49 (0) 89 / 99 39 13 125

                        Fax +49 (0) 89 / 99 39 13 124

                        sennewald.elke@kendle.com

 

 


Abstract

 

In their critical path initiative, the FDA underlines the urgent need for a standardized approach to capture, receive and analyze clinical study data. Providing the data definition document in a machine readable format likesuch as  XML increases the level of automation and improves the efficiency of the regulatory review process. Case Report Tabulation Data Definition Specification (CRT DDS) is the CDISC standard for providing metadata in XML format for an electronic submission to regulatory authorities such as the FDA.

This paper will describe how we have implemented CRT DDS (commonly known as define.xml) Standards Version 1.0.0 at Kendle. Furthermore, it demonstrates how Kendle uses the SAS®SAS based tool DefinedocTM from Meta‑Xceed, Ind. (MXI)12 in the define.xml generation process. Also quality control processes developed by Kendle will be discussed and some interesting features of define.xml and issues encountered with the CDISC guidelines will be highlighted.

 


Keywords

 

CDISC

define.xml

DefinedocTMDefinedoc

metadata

 


Introduction

 

 

After the FDA’s announcement in their Study Data Specifications11 document that define.xml is considered as the preferred submission standard for SDTM metadata, define.xml became the most important format for SDTM metadata submission, thus replacing define.pdf.

 

“The specification for the data definitions for datasets provided using the CDISC SDTM is included in the Case Report Tabulation Data Definition Specification (define.xml) developed by the CDISC define.xml Team. The latest release of the Case Report Tabulation Data Definition Specification is available from the CDISC web site (http://www.cdisc.org/models/def/v1.0/index.html). Include a reference to the style sheet as defined in the specification and place the corresponding style sheet in the same folder as the define.xml file.” 11

 

But it is not only FDA’s preference that makes define.xml the format of choice. The main advantage of the XML format is its both machine and human readability. Moreover the XML format is platform independent which facilitates the data transfer between many kinds of systems.

 

 

However, XML being is new to the industry and without any XML experience hard to understand.  T, the first define.xml example1 published by CDISC in 2007 provided a very good first insight into what define.xml is and demonstrated is about and how the standard works.

Another good example was found iIn 2008, CDISC published the SDTM/ADaM Pilot Project7 which was, a collaborative project between CDISC, FDA and the industry to assess how the define.xml standard can be implemented in a real study.

 

Besides the CDISC define.xml guideline5, both the CDISC 2007 define.xml example1 and the SDTM/ADaM Pilot Project7 were very useful for Kendle while implementing a process for generating define.xml generation process.


Define.xml Ggeneration

 

There are many ways to create define.xml files. Basically any format that provides some kind of interface between study data including the corresponding metadata and XML can be used to store data and metadata and derived a define.xml file from it.

One option is to use an entirely XML based solution, where both data and metadata are stored in XML format. If all mapping and analysis processes use SAS®, however, this is apparently not the most efficient option, since handling XML files in SAS® is not very convenient.

Another option possibility is to use a SAS® based solution. Without any additional tools though, one needs to be quite familiar with the basic XML concept and structure to ensure correctness and completeness of the define.xml output created by SAS®.

 

This paper will outlined an enriched SAS® based approach developed at Kendle that uses SAS® to provide all necessary data and metadata through and a commercially available SAS®SAS based software called DefinedocTMDefinedoc  to convert this information into XML. While the focus will be on the creation of define.xml files for SDTM data, the same approach can be used for ADaM data.

 

The starting point for this SAS®SAS based approach is the specification document stored in as an Excel® file.

<Insert Figure 1 here, half page width>

Figure 1 shows a flow chart of the define.xml generation process implemented at Kendle.

 

Step 1 – Specification Document

In a first step the specification document, containing all metadata is set up. Based on this specification document the SDTM domains are created in SAS® and SDTM validation checks are performed to check for inconsistencies between the CDISC SDTM Guideline and the SDTM domains. The specification document and the SAS® domains are the basis for the define.xml file which is created by using DefinedocTM and supplemental SAS®SAS macros and programs. Further checks are applied to ensure consistency between data, metadata and define.xml. All necessary steps and  checks will be discussed in more detail below.

<Insert Figure 2 here, half page width>

Figure 2 illustrates the four steps of the define.xml generation process with Definedoc TM and the supplemental SAS® programs.

 

First step

 

In a first step, DefinedocTM is run used to extract and store all metadata available in the SAS® domains. Figure 3 shows how the main screen (data definition screen) of DefinedocTM which outlines all information on project including:, study, SAS® library names, input SAS® datasets, paths, output directories, output files as well as variable and dataset order.

<Insert Figure 3 here, half page width>

In thea second screen (information screen), general study information, like e.g. company name, product name, protocol number, XPT file location and annotated CRF location can be entered. An example of the general information screen is shown in figure 4.

<Insert Figure 4 here, half page width>

By running applying the DefinedocTM software, the following information is extracted from the SAS® domains:

·   List of domains (sorted by class and domain names, as defined by CDISC1)

·   List of variables (order of appearance same as in domains)

·   Variable names

·   Variable labels

·   Variable length

·   Number of variables

·   Number of records


A set and a batch of working files (SAS® datasets _define.sas7bdat and corresponding backup and audit trail files) is created, as shown in figure 5, to manage which contain the information described above.

<Insert Figure 5 here, half page width>

These working files are used to create the structure of the define.xml file. Furthermore, a stylesheet (define.xsl) is automatically generated by DefinedocTM.


An example for the output of this first step is shown in figure 6.

<Insert Figure 6 here, full page width>

 

Second step

Step 2 Automate Data Attributes

To include all additional information that is needed for define.xml, SAS® programs and macros were developed at Kendle, which are applied in the second step of define.xml generation. The SAS® datasets generated by DefinedocTM (see figure 5) are enhanced with input from the specification document by merging the following information to the datasets:

·   Domain description

·   Domain structure

·   Domain purpose

·   Domain class

·   Domain keys

·   Variable controlled terms or formats

·   Variable origin

·   Variable role

·   Comments

·   Value level metadata

·   Nested variables

·   Code lists

·   Repeating attribute


The structure of the specification document used is similar to the structure of the SAS® datasets. Typically it consists of three parts: dataset, variable and value level metadata.

Figure 7 shows an extract from the variable metadata of the specification document.


<Insert Figure 7 here, full page width>


The code list content is stored in the Controlled Terms or Format columns of the variable and value level sections. The column Codelist of the Excel® specification document (see figure 7) contains the name of the code list. Whenever this column is populated, filled DefinedocTM creates a code list for define.xml.

Information on variable nesting is stored in the column Nested Variable of the Excel® specification document (see figure 7). If a variable name (e.g. LBTESTCD) is inserted into this column, DefinedocTM captures the corresponding nested variable from the same row (e.g. LBCAT) and creates a nesting relation in the define.xml file. More details on the definition of nested variables and how they are implemented in define.xml can be found in the next chapter.

Furthermore, the ODM type of the variables is determined by SAS® algorithms searching for special character patterns in the data (SDTM domains).

Information from all sources - SAS® datasets created by DefinedocTM, specification document and SDTM domains - is combined and consistency checks are run.

As a result the DefinedocTM datasets (see figure 5) are updated to include all updated information. An example for an updated _define.sas7bdat dataset is shown in figure 8.


<Insert Figure 8 here, full page width>

 

Third step

Step 3 Generate Updated Define.xml

Based on datasets created in the previous step DefinedocTM is run again to create a define.xml file, which contains all information gathered so far (see figure 9).

<Insert Figure 9 here, full page width>

 

Fourth step

Step 4 – Update XMl and XSL

In a final step the define.xml and define.xsl files are slightly modified, using SAS® programs and macros, to finalize the define.xml layout. An example One of these modifications is the deletion of the columns ‘Number of Variables’ and ‘Number of Records’ (see figure 6) that are generated by DefinedocTM, but not required by CDISC. Figure 10 shows the first page of the final define.xml file.

<Insert Figure 10 here, full page width>


Selected Define.xml Features

 

In this chapter some selected features of the final define.xml file are described in more detail.

 

SAS®SAS types vs. ODM types

 

It is commonly known that SAS®SAS distinguishes between two variables types including:only – character and numeric. As per CDISC ODM 1.2 guideline2 a variable can be assigned one out of six different types – integer, float, date, datetime, time and text.

The easiest way to determine the type of a variable is to directly derive this information from SAS®SAS datasets (SDTM domains) themselves. Using regular expressions, Kendle developed a SAS®SAS program to search for special character patterns in the data values:

1.      The type of all character variables of which the variable name does not end with -DTC is set to ‘text’.

2.      All numeric variables are sorted into types ‘float’ or ‘integer’ depending on their number of decimal places.

3.      The type of the remaining variables (name ending with -DTC) is set to ‘date’, ‘time’, or ‘datetime’, depending on the result of the algorithm displayed in figure 11.
 

<Insert Figure 11 here, full page width>

 

Nested variables

 

Nested variables are variables that are linked with to each other. In most cases nested variables are category (–CAT) and results variables (–TESTCD). Each category variable value corresponds to certain result variable values and one each result variable value corresponds to one or more category variable values. Figure 12 shows the relation between category variable LBCAT and the corresponding result variable LBTESTCD.


<Insert Figure 12 here, full page width>


In this example the variable LBCAT is linked to the value level metadata section of this variable;, where all possible values of LBCAT are listed. One of these values is BIOCHEMISTRY. Clicking on the hyperlink navigates the user to a list of all LBTESTCD values for the category BIOCHEMISTRY.

While this functionality of define.xml was not implemented in the CDISC SDTM/ADaM Pilot Project7, it is described in the CDISC 2007 define.xml example1.

 

Navigation without back button

 

In cooperation with Meta-Xceed, hyperlinks were implemented to facilitate the navigation between the metadata levels (see figure 12). Clicking on the hyperlink, LBCAT navigates the reviewers to the value level metadata section of LBCAT. Similarly, And clicking on the header of the value level metadata (ValueList.LB.LBCAT) navigates them back to LBCAT in the variable level. This allows for the navigation between levels without using the back buttons.

 


Automated code lists generation

 

Information for variables –TESTCD, –TEST, –PARMCD, –PARM, QNAM and QLABEL is are stored in the value level metadata section of define.xml.,  As anFor example,  the information for –TESTCD variables can be found in the ‘Value’ column and for –TEST variables in the ‘Label’ column respectively. Kendle implemented an automated process in SAS®SAS to extract this information from the value level and create code lists for each –TESTCD and –TEST variable out of it. The ‘Value’ column is presented in the ‘Code Value’ column and the ‘Label’ column is shown  in the ‘Code Text’ column. See figure 13 as an example.

<Insert Figure 13 here, half page width>

 

‘Class’ column

 

Although the ‘Class’ column is not required by the Final SDTM 3.1.1 Guideline1, Kendle includes the class column as described in the Draft SDTM 3.1.2 Guideline10 in the define.xml browser representation (see figure 10). In this case, tThe ItemGroupDef attribute def:Class in define.xml is used to store the class information.

 

Adaption of SDTM define.xml for ADaMs

 

The structure of the SDTM and ADaM metadata is very similar. Both models contain possess three levels of metadata including: dataset, variable and value level metadata. Thus, for ADaM datasets, steps one to three of the define.xml generation process can be performed in the same way as described above (see also figure 2) for SDTM metadata.


Just tThe stylesheetstyle sheet used with the for the browser representation of define.xml has to be adapted in step four:

·   At dataset level: SDTM column ‘Class’ is changed to ‘Documentation’ for ADaMs

·   At variable level: SDTM column ‘Comment’ is changed to ‘Source’ for ADaMs

·   At value level: SDTM columns ‘Label’, ‘Value’ and ‘Comment’ are changed to ‘Param’, ‘Paramcd’ and ‘Source/Computational Method’ respectively

·   Links to analysis results metadata (documents in PDF format) are created in the navigation bar

·   Where applicable, links to additional supplemental documentation (documents in PDF format) are created in the navigation bar

When adapting SDTM define.xml for ADaMs no checks against the CDISC schemas (define1-0-0.xsd, etc.) can be performed, as these apply to SDTM only.

 


Consistency Cchecks

 

When During generating the define.xml,  generation it is important to validate not only the syntax but also the content and the semantic of the XML file. Therefore, several check mechanisms were implemented by Kendle to ensure correctness and CDISC compliance.

 

SDTM validation checks

 

First of all, SDTM data checks are performed that are based upon the WebSDM V1.5 edit checks as published on the CDISC standards website3 and based upon the FDA Draft Specifications for SDTM Validation Criteria 4 to evaluate the adherence to the SDTM guidelines.

 

Additional SDTM validation checks were developed by Kendle

 

To supplement the SDTM validation checks as published on the CDISC website, Kendle implemented the following additional SDTM validation checks:

·   Identification of variables of which label listed in domain description is not consistent with label implicit in SAS®SAS dataset

·   Identification of variables defined as key in (study specific) description but for which uniqueness is not present in SAS®SAS datatsetsdatasets

·   Identification of variables of which role listed in domain description is not consistent with role in (study specific) description file

·   Identification of columns with values equal null (empty) for which (Standard) Core attribute is 'Perm'

·   Identification of domain tables of which the order of the variables in the SAS®SAS dataset is not consistent with (study specific) description file

·   Identification of variables of which variable length listed in (study specific) domain description is not consistent with variable length implicit in SAS®SAS dataset

·   Identification of null (empty) values found in a column for which (study specific) Core attribute is 'Req' for new domains (X-)

·   Identification of values that are not unique in (study specific) domain description per category and subcategory

 

Define.xml syntax checks

 

In addition to the SDTM validation checks, the define.xml syntax was validated against the CDISC schemas published on the CDISC define.xml5 and ODM 1.2.1 standards website6 to ensure schema compliance. The schemas are connected to each other as shown in figure 14.


<Insert Figure 14 here, half page width>

 

Define.xml content checks

 

Besides the syntax checks,  maintaining the correctness accuracy of the define.xml file content is  very significantof major importance. An extract  of the checklist applied by Kendle is shown can be found below:

·   Identification of variables for which the type in domains and define.xml does not match CDISC ODM 1.2 types

·   Identification of variables or values for which the column ‘Controlled Terms or Format’ of define.xml does not contain code list links

·   Identification of values for which nested variables are not appropriate

·   Identification of values that occur in more than one category (nested variables) and check of correctness

·   Identification of value lists with more than one code list name

·   Identification of code list names with more than one value list

·   Identification of variables which occur in more than one domain and where the variable labels does not match

·   Identification of variables which occur in more than one domain and where the variable length do not match

·   Identification of variables for which ‘Core’ column does not equal REQ, PERM or EXP

·   Identification of duplicate values for domain, variable, value, category and subcategory

·   Identification of domains that occur on variable metadata level but are not listed on dataset level

·   Identification of variables that occur on value level but are not listed on variable level

·   Identification of variables that are defined as nested variables but do not exist on variable metadata level

·   Identification of implausible nested variables where the strings ‘CAT’ and ‘TESTCD’ cannot be found in the variable name

·   Identification of variables for which the content of the Excel® specification document does not match the content of the DefinedocTMDefinedoc generated SAS®SAS datasets

·   Identification of variables (–DUR, –DTC, –ELTM, –EVLINT) for which ISO 8601 is not correctly set in ‘Controlled Terms or Format’ column

 


Obstacles

 

While developing the Kendle approach for the define.xml file creation, we encountered a few obstacles which might may be of interest for thoseto everyone creating their own standardization approach.

 

ODM version 1.2 vs. 1.3

 

When define.xml version 1.0.0 was released in 2005, ODM version was 1.2 was the most recent version. Thus, define.xml being an extension to ODM, it is based on ODM 1.2. In the meantime ODM was further developed and the updated ODM version 1.3 was released in 2006. However, this enhancement has not yet been considered in the define.xml schemas. These schema is still referenced the ODM 1.2 (see figure 15).


<Insert Figure 15 here, full page width>


While define.xmlmxl version 1.0.0 is not yet replaced by the next version, ODM 1.2 should be used for electronic submission to the FDA. Otherwise, the FDA will not be able to read the define.xml file9.

 

‘Controlled Terms or Format’ column

 

There are discrepancies between the CDISC SDTM guideline1 and the CDISC define.xml guideline5 with respect to the use of the ‘Controlled Terms or Format’ column. On the one hand side, the CDISC SDTM guideline states that the content of this column could be any kind of text. On the other hand side, the define.xml guideline only allows for the use of code lists in this column. Thus, the user is forced to include all information in form of code lists9.

 

‘Repeating’ attribute

 

The CDISC ODM and define.xml guidelines do not clearly define the use of the ‘Repeating’ attribute, especially for trial design domains such as: TA, TE, TI, TS and TV. From the authors’ point of view, the most logical approach is to define this attribute as follows9:

·   Repeating=“Yes” for domains with more than one record per unique subject identifier

·   Repeating=”No” for domains with one record per unique subject identifier

·   Repeating=”No” for all trial design domains

 


Conclusion

 

When generating define.xml for SDTM datasets, not only should you refer to the define.xml guideline, but you should also review the essential SDTM and ODM guidelines are essential. The SDTM guideline describes the structure of the data (SDTM domains), and therefore thus is the basis for the documentation of metadata. Besides many other features, ODM stores the metadata pertaining to the SDTM domains and define.xml asis an extension to ODM which enriches it by many elements and attributes.

As the ODM and SDTM guidelines have significantly evolved since its previous the last define.xml version release in 2005, an update of the define.xml guideline is more than overdue.

The FDA Study Data Specifications11 allows for metadata to be submitted as define.xml file and for SDTM data as SAS®SAS XPT files. The next, most logical step would be to also allow for SDTM data to be submitted in XML format. This way, both, data and metadata could be submitted in a single format;, thus simplifying the whole submission process for industry and regulatory agencies as well.

It also worth mentioning, that the demand for a define.xml ADaM guideline is growing steadily since the publication of the CDISC SDTM/ADaM Pilot Project7.

With the introduction of the HL7-XML standard, however, further developments of define.xml remain to be seen.

 


Acknowledgements

 

The authors would like to thank Sy Truong from Meta-Xceed, Ind. (MXI) for his contribution and continued support during the define.xml implementation process.

 


References

 

1.      CDISC Metadata Submission Guidelines, Appendix to the Study Data Tabulation Model (SDTM) Implementation Guide 3.1.1, Draft Version 0.9, 25.Jul.2007 (http://www.cdisc.org/models/sdtm/v1.1/index.html)

2.      CDISC Specification for the Operational Data Model (ODM), Version 1.2, January 2004 (http://www.cdisc.org/models/odm/v1.2/ODM1-2-0.html)

3.      Validation checks performed by WebSDMTM on SDTM version 3.1.1 datasets, Version 1.5, 12.Apr.2007 (http://www.phaseforward.com/products/safety /documents/ValidationChecksPerformedbyWebSDMtm.Q107.pdf)

4.      FDA Draft Specifications for SDTM Validation Criteria (v3.1, v3.1.1), Version 0.1, 01.Sep.2008 (http://www.fda.gov/oc/datacouncil/janus_sdtm_validation _specification_v1.pdf)

5.      CDISC Case Report Tabulation Data Definition Specification (CRT-DDS, also called define.xml), Version 1.0, 10.Feb.2005 (http://www.cdisc.org/models/def/ v1.0/index.html)

6.      CDISC Operational Data Model, Final Version 1.2.1, January 2005 (http://www.cdisc.org/models/odm/v1.2.1/index.html)

7.      CDISC SDTM/ADaM Pilot Project (http://www.cdisc.org/membersonly /members_sdtm.html)

8.      CDISC Operational Data Model, Final Version 1.3, latest change data 19.Dec.2006 (http://www.cdisc.org/models/odm/v1.3/final/ODM1-3-0-Final.htm)

9.      CDISC Public Discussion Forum, Case Report Tabulation
(CRT-DDS or define.xml) (http://www.cdisc.org/discussions/index.html)
- Case Report Tabulation (CRT-DDS or define.xml)

Thread: ODM 1.2 or ODM 1.3, October 2008

Thread: Controlled Terms or Format Column, October 2008

- ODM V1.3 Final

Thread: Repeating Attribute, October 2008

10.  CDISC Study Data Tabulation Model (SDTM) Implementation Guide, Draft Version 3.1.2, 10.Jul.2007 (http://www.cdisc.org/models/sdtm/v1.2/index.html)

11.  FDA Study Data Specifications, Version 1.4, 01.Aug.2007 (http://www.fda.gov/CDER/regulatory/ersr/Studydata.pdf)

12.  Meta-Xceed, Ind. (MXI) Home (http://meta-x.com/definedoc/index.html)

 


Figures

 

Figure 1: Flow chart for define.xml generation process

 

 


Figure 2: DefinedocTMDefinedoc, supplemental SAS®SAS programs and define.xml checks

 

 


Figure 3: DefinedocTMDefinedoc – Data Definition Screen

 

 


Figure 4: DefinedocTMDefinedoc – General Information Screen

 

 


Figure 5: List of working files

 

 


Figure 6: First step of define.xml generation, DefinedocTMDefinedoc output, structure of the define.xml file

 

 


Figure 7: Variable metadata of the specification document

 

 


Figure 8: Second step of define.xml generation, result of SAS®SAS program run; all metadata included in a single SAS®SAS dataset

 

 


Figure 9: Third step of define.xml generation, DefinedocTMDefinedoc output with all metadata included

 

 


Figure 10: Fourth step of define.xml generation, final define.xml

 

 


Figure 11: SAS®SAS algorithm for type determination

 

 


Figure 12: Nested variables and hyperlinks

 

 


Figure 13: Automated code list generation

 

 


Figure 14: Relation between CDISC define.xml and ODM schemas

 

 


Figure 15: Reference from define.xml schemas to ODM 1.2.1 schemas.